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CASE-BASED  SONOGRAM  CLASSIFICATION 


1.  INTRODUCTION  AND  MOTIVATION 


Fala  and  Walker  (1993)  describe  results  of  applying  three  novel  case-based  reasoning  (CBR) 
algorithms  to  a  submarine  classification  task  that  used  data  obtained  from  sonogram  line  readings. 
Although  some  of  their  algorithms  appeared  to  perform  well,  they  did  not  describe  comparisons 
with  alternative  algorithms.  It  is  difficult  to  assess  the  performance  of  their  algorithms  without 
these  comparisons. 

We  replicated  their  studies,  included  comparisons  with  several  other  algorithms  from  the  ma¬ 
chine  learning  literature,  and  included  two  studies  suggested  as  future  work  by  Fala  and  Walker. 
We  discovered  strengths  and  weaknesses  of  their  algorithms.  We  also  found  ways  to  improve  their 
performance.  This  report  details  our  studies  and  summarizes  ways  for  incorporating  additional 
domain-specific  knowledge  into  case-based  classifiers. 

Section  2  details  Fala  and  Walker’s  (1993)  experiments.  Our  results  with  the  same  dataset  are 
described  in  Section  3.  Section  4  discusses  the  ramifications  of  these  results  and  Section  5  provides 
suggestions  on  how  to  incorporate  more  domain-specific  knowledge. 


2.  CONTEXT:  NAWC’S  SONAR  ANALYSIS  SYSTEM 


Fala  and  Walker  (1993)  analyzed  their  CBR  tools'  ability  to  automatically  classify  acoustic  sonar 
images  of  submarines.  Their  interviews  with  experts  who  do  this  task  (i.e.,  three  aviation  anti¬ 
submarine  warfare  technicians)  suggested  that  experts  might  use  CBR-like  classification  strategies. 
This  motivated  Fala  and  Walker  to  create  a  CBR  system  that  automates  this  process. 

They  began  by  collecting  low-level  features  describing  lines  in  sonogram  readings.  These  were 
used  to  design  a  representation  for  cases.  More  specifically,  21  cases  were  compiled  using  an 
automated  system  for  extracting  lines  from  sonograms.  Each  case  was  then  classified  by  an  expert 
into  one  of  five  possible  classifications  (i.e.,  submarines).  The  raw  cases  contained  between  16  and 
49  lines,  all  taken  at  the  same  noise  levels.  The  frequency  resolution  was  100  and  the  frequencies 
of  the  lines  ranged  between  five  and  400  Hz. 

The  raw  data  were  not  represented  directly  in  the  cases.  Instead.  Fala  and  Walker  incorporated 
the  notion  that  humans  can  visually  separate  lines  as  close  as  1/32  of  an  inch.  Their  representation 
was  based  on  counting,  individually  for  each  sonogram,  the  number  of  lines  per  frequency  boundary. 
The  frequency  boundary  for  a  given  line  was  obtained  using 


frequency(line)  =  truncate 


line  -  5 
3  -I-  ^ 
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When  applied  to  the  given  21  sonograms,  this  formula  yields  frequency  boundaries  in  the  range 
[0, 127].  Since  this  function  can  map  many  frequencies  to  a  single  frequency  boundary,  the  128 
feature  values  constituting  each  case  were  non-negative  integers. 

Fala  and  Walker  used  the  standard  leave-one-out  strategy  (Weiss  and  Kulikowski  1991)  to 
evaluate  the  classification  accuracy  of  three  novel  variants  of  the  nearest-neighbor  algorithm  (Fix 
and  Hodges  1951;  Cover  and  Hart  1967;  Duda  and  Hart  1973).  This  strategy  simply  includes  all 
but  one  case  in  the  training  set  and  uses  the  remaining  case  as  the  only  one  used  during  testing. 
This  is  repeated  once  for  each  case  in  the  data  set.  Since  one  of  the  cases  had  a  unique  classification 
relative  to  the  remaining  cases,  it  was  not  included  in  their  experiments.  Thus,  only  20  of  the  21 
cases  were  used  in  their  experiments,  which  left  only  four  classes  represented  by  cases  in  the  dataset. 

The  nearest-neighbor  algorithm  has  been  extensively  analyzed  in  the  literature  on  pattern  recog¬ 
nition  (Dasarathy  1991)  and  machine  learning  (.\ha,  Kibler,  and  Albert  1991),  where  it  is  viewed 
as  an  instance-based  learning  (IBL)  algorithm.  For  the  purposes  of  this  report,  IBL  algorithms  can 
be  thought  of  as  consisting  of  the  following  three  functions: 

1.  Normalization:  Preprocesses  the  data,  and  is  primarily  used  to  equalize  the  relative  influences 
of  features  in  similarity  computations. 

2.  Similarity:  Used  to  compute  the  similarity  between  two  cases. 

3.  Prediction:  Given  the  results  of  the  similarity  computations,  this  function  details  how  a 
classification  prediction  is  made. 

There  is  no  standard  normalization  function  used  with  the  nearest-neighbor  algorithm.  However, 
it  is  defined  as  using  the  following  Euclidean  (dis)similarity  function  (assuming  F  features  are  used 
to  describe  each  case  x  and  y): 

__ 

(dis)Similarity(x,(/)  =  .  ^(-c.  -  Vt)^-  (2) 

\  1=1 

The  nearest-neighbor  prediction  function  simply  predicts  that  the  given  case’s  cla^s  is  the  same 
as  that  of  its  most  similar  case  (i.e.,  the  least  distant  case).  Several  studies  on  machine  learning, 
case-based  reasoning,  statistics,  pattern  recognition,  cognitive  psychology,  and  other  topics  have 
used  this  algorithm  in  empirical  comparison  studies  as  a  straw-man  due  to  its  simplicity  and  popu¬ 
larity  (Sebestyen  1962;  Reed  1972;  Duda  and  Hart  1973;  Shepard  1983;  Breiman,  Friedman,  Olshen, 
and  Stone  1984;  Kibler  and  Aha  1987:  Bareiss  I9.'^9;  Dasarathy  1991:  Weiss  and  Kulikowski  1991; 
Shavlik,  Mooney,  and  Towell  1991:  Aha.  Kil>ler,  ami  .Mbert  1991).  It  is  well-known  that  while  the 
nearest-neighbor  algorithm  is  a  relatively  robu>t  cla>'>ifier.  its  primary  diawhacks  include  an  inabil¬ 
ity  to  tolerate  irrelevant  attributes,  large  >t<>rau'‘  re«|iiirements,  and  relatively  high  computational 
complexities  for  classifying  new  cases. 

Fala  and  Walker’s  (1993)  variants  of  the  ue.iri-.t  neighbor  function  used  no  normalization  func¬ 
tion,  used  the  nearest-neighbor  prediction  function,  and  did  not  involve  repairs  to  these  drawbacks. 
Instead,  they  used  novel  similarity  functions,  which  they  called  comparison  operators.  The  defini¬ 
tions  of  these  three  functions,  which  are  listed  in  Table  1.  were  derived  from  their  interviews  with 
expert  sonogram  classifiers.  More  specifically.  Matches  was  suggested  by  experts  noting  that  a 
given  frequency  boundary  of  the  two  sonograms  both  contain  or  do  not  contain  lines.  The  HITS 
function  corresponds  to  the  number  of  boundaries  in  the  two  sonograms  that  both  contain  lines; 
it  sums  the  number  of  such  lines  in  each  such  freciuency  boundary.  Finally,  the  MlSSES  function 
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Table  1  -  Similarity  Functions  Used  by  Fala  and  Walker  (1993),  Where  x  is  the  New  Case  and  y  is 
the  Stored  Case _ 


NAME 

DEFINITION 

Matches 

if  (xi  =  y,)  or  ((xi  >  1)  and  (y,  >  1))  then  1  else  0 

Hits 

XI-Li  if  ((ar,  >  1)  and  (y,  >  1))  then  min(x,,  y,)  else  0 

Misses 

J2i=i  if  ((^1  >  1)  and  (y,  =  0))  then  a:,  else  0 

corresponds  to  the  number  of  boundaries  in  a  stored  case’s  sonogram  that  are  missing  lines  visible 
in  the  new  case. 

Innumerable  methods  exist  for  making  predictions  when  there  is  a  tie  among  the  most  similar 
stored  cases.  Walker  (1993)  noted  that  the  method  used  in  their  study  was  perfectly  optimistic.  If 
any  of  the  most  similar  neighbor’s  classifications  matched  that  of  the  test  case,  then  the  classification 
was  deemed  to  be  correct.  We  used  this  same  optimistic  tie-breaking  method  in  our  own  experiments 
with  nearest-neighbor  variants. 

Fala  and  Walker’s  algorithms  compare  two  sonograms  based  on  their  number  of  lines  per  fre¬ 
quency  boundary.  However,  each  sonogram  involved  a  different  number  of  line  readings  (i.e., 
between  16  and  49).  Therefore,  they  scaled  their  cases  by  using  the  percentage  of  lines  in  a  given 
frequency  boundary  rather  than  their  raw  number.  These  case  representations  were  obtained  from 
the  raw  data  by  dividing  each  case's  feature  value  by  the  number  of  total  lines  in  that  case. 

Fala  and  Walker  (1993)  reported  their  leave-one-out  results  using  this  128-feature  representation 
for  these  three  variants  of  the  nearest-neighbor  algorithm.  The  respective  classification  accuracies 
for  Matches,  Hits,  and  Misses  were  were  14/20  (70%),  17/20  (85%  ),  and  5/20  (30%).  Guessing 
randomly  among  the  four  classes  yields  an  accuracy  of  25%>.  whereas  always  guessing  the  most  fre¬ 
quent  class  yields  an  accuracy  of  40%  .  Simply  using  nearest -neighbor  prediction  when  representing 
the  cases  with  their  average  line  reading  yields  -55%. 

3.  FOLLOWUP  STUDIES 

While  the  accuracies  recorded  by  Matches  and  Hits  are  greater  than  that  attainable  by  always 
guessing  the  most  frequent  class  in  the  dataset,  it  is  not  obvious  from  this  study  alone  whether  they 
are  “very  good.”  Our  first  goal  was  to  investigate  this  claim.  We  also  tested  several  alternative 
preprocessing  strategies  that  have  been  shown  to  dramatically  alter  classification  performance  on 
.some  problems  ( Aha  1990;  Turney  1993).  Finally,  we  investigated  Fala  and  Walker’s  two  suggestions 
for  future  work.  Their  first  suggestion  involves  combining  the  effects  of  their  algorithms.  Their 
second  suggestion  involves  examining  the  algorithms'  behavior  when  using  a  more  continuous  case 
representation.  These  studies  are  detailed  in  the  following  subsections. 

3.1  A  Comparison  Study 

We  selected  our  suite  of  algorithms  from  among  several  commonly  known  algorithms  in  the 
pattern  recognition  and  machine  learning  literatures.  In  doing  so,  we  also  replicated  Fala  and 
Walker's  experiments  with  their  three  ca.se-based  learning  algorithms. 

The  first  comparison  algorithm  we  included  is  ->MISSES,  which  is  identical  to  Misses  except 
that  it  negates  the  computed  sums  before  making  classification  predictions.  We  did  this  because 
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Table  2-  Lpave-One-Out  Results  on  the  Sonar  Database  Using  Scaled  Data 


Classifier 

Fala  and  Walker’s  (1993)  Results 

Our  Results 

random  guess 

25% 

25% 

guess  most  frequent 

40% 

40% 

Matches 

70% 

70% 

Hits 

85% 

85% 

Misses 

30% 

30% 

-■Misses 

75% 

Hamming 

85% 

Euclidean 

70% 

Cubic 

75% 

C4.5 

60% 

CN2 

40% 

Backprop 

85% 

Misses  computes  dissiinilarities  rather  than  similarities.  This  can  be  observed  by  noting  that  its 
values  increase  as  the  number  of  frequency  boundaries  differ.  Fala  and  Walker’s  (1993)  empirical 
results  support  the  fact  that  this  poor  similarity  function  is  outperformed  by  always  guessing  the 
most  frequently  occurring  class  in  their  dataset. 

Given  our  familiarity  with  the  nearest -neighbor  classifier  and  its  similarity  with  Fala  and 
Walker's  algorithms,  it  is  natural  to  ask  what  accuracies  it  can  attain.  Therefore,  we  tested  three 
additional  variants  of  this  algorithm.  These  variants  differ  only  in  their  similarity  function.  The 
Euclidean  similarity  function  shown  in  Eq.  (2)  is  actually  the  Minkowski  metric  with  r  =  2: 

~F 

1=1 

We  used  the  Minkowskian  dissimilarity  function  with  r  =  1  (Hamming),  with  r  =  2  (Euclidean), 
and  with  r  =  3.  which  we'll  refer  to  as  Cubic. 

We  also  tested  three  common  machine  learning  algorithms:  a  decision  tree  inducer  named  C4.5* 
(Quinlan  1993),  a  decision  rule  inducer  named  CN2^  (Clark  and  Niblett  1989;  Clark  and  Boswell 
1991),  and  the  Backpropagation  algorithm^  ( Rumelhart.  McClelland,  and  the  PDF  Research  Group 
1986). 

Table  2  summarizes  the  results  for  the  algorithms  alongside  the  results  from  the  original  study 
and  the  baseline  results  for  guessing  randomly  or  always  guessing  the  most  frequent  class  in  the 
dataset.  As  with  the  original  study,  we  .scaled  the  data  before  applying  the  algorithms.  Four 
observations  are  noteworthy: 

'C4.5  gave  the  same  results  when  run  both  with  and  without  its  post-pruning  option  in  effect. 

^CN2  was  tested  on  24  combinations  of  its  parameter  settings.  This  included  all  four  of  its  error  estimating 
strategies,  three  values  for  its  star  size  (i.e.,  3.  5.  and  7),  and  both  with  and  without  its  maximum  class  prediction 
option.  Its  chi-square  threshold  value  was  always  set  to  0. 

^Backprop  was  tested  once  for  each  of  384  combinations  of  its  input  parameters,  including  two  methods  for 
normalizing  the  input  data  (i.e.,  simple  linear  interval  and  z-score).  four  momentum  values  (i.e.,  0.1,  0.4,  0.7,  and 
0.9),  four  learning  rates  (0.01,  0.1,  0.3,  and  0.5).  four  temperatures  (0.1.  0.5,  1.0,  and  2.0),  and  three  numbers  of 
hidden  units  (i.e.,  5,  10,  and  25).  Cla-ssificatioii  was  ba.sed  on  the  output  node  with  the  highest  activation. 


(dis)Similarity(x,  j/)  = 
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fable  3-  Leave-One-Out  Results  on  the  Sonar  Database  Using  Scaled  and  Unsealed  Data 


Classifier 

With  Scaled  Data 

VV^ith  Raw  Data 

random  guess 

25% 

25% 

guess  most  frequent 

40% 

40% 

Matches 

70% 

70% 

Hits 

85% 

80% 

Misses 

30% 

30% 

--Misses 

75% 

70% 

Hamming 

85% 

70% 

Euclidean 

70% 

75% 

Cubic 

75% 

65% 

C4.5 

60% 

65% 

CN2 

40% 

50% 

Backprop 

85% 

90% 

1.  VVe  replicated  the  original  results. 

2.  As  expected,  the  -iMlssES  similarity  function  easily  outperformed  Misses  and  performed 
comparatively  well  with  the  other  functions  tested. 

3.  The  Minkowski  metric’s  results  were  somewhat  sensitive  to  the  value  of  r.  The  Hamming 
distance  function  performed  as  well  as  Hits. 

4.  The  machine  learning  algorithms  fared  poorly  because  they  perform  comparatively  well  only 
with  larger-sized  databases.  Previous  research  also  suggests  that  C4.5  and  CN2  may  not 
work  well  when  the  data  is  completely  numeric  (Aha  1992).  As  usual.  Backprop  performed 
well  primarily  because  we  tested  it  on  a  large  number  of  values  for  its  many  parameters. 

In  general,  it  appears  that  the  performance  of  the  Matches  and  Hits  algorithms  selected  by 
Fala  and  Walker  performed  well  relative  to  this  suite  of  algorithms. 

3.2  Alternative  Preprocessing  Strategies 

It  is  possible  that,  by  using  different  normalization  functions  in  the  case- based  classifiers,  higher 
classification  accuracies  can  be  obtained.  However,  it  is  not  obvious  which  case  representation, 
scaled  or  raw,  supports  better  classification  performance. 

Although  Fala  and  Walker  rea,soned  that  their  data  should  be  scaled,  it  is  not  obvious  what 
gains  were  obtained  by  doing  so.  Therefore,  we  repeated  the  experiments  using  the  unsealed 
ra.se  repre.sentation.  The  results  are  displayed  in  Table  3.  Four  of  the  algorithms  recorded  lower 
accuracies  using  this  repre.sentation  while  three  recorded  higher  accuracies.  While  HiTs's  accuracy 
decreased  slightly,  the  accuracies  of  the  other  new  algorithms  did  not  change.  Therefore,  it  is  not 
obvious  which  representation  supports  better  classification  performance.  Given  this,  we  retested 
the  case-based  classifiers  using  both  cast  representations  in  our  next  study. 

In  the  previous  experiments  no  normalization  function  was  used  in  any  of  the  algorithms.  VVe 
were  curious  as  to  whether  improved  cla.ssificatioii  accuracies  could  be  obtained  by  using  normaliza¬ 
tion  functions.  Therefore,  we  compared  our  previous  results  with  those  obtained  using  two  standard 
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Table  4-  Leave-One-Qut  Results  on  the  Sonar  Database  Using  Scaled  and  Unsealed  Data 


Classifier 

With  Scaled  Data 

With  Raw  Data 

Matches 

70% 

70% 

Hits 

85% 

75% 

85% 

80% 

75% 

70% 

Misses 

30% 

20% 

20% 

30% 

20% 

25% 

-■Misses 

65% 

70% 

70% 

65% 

70% 

Hamming 

85% 

70% 

55% 

70% 

60% 

60% 

Euclidean 

70% 

Q9 

80% 

75% 

Cubic 

85% 

65% 

75% 

80% 

normalization  strategies.  The  first,  named  linear  interval.  normaUzes  value  r  of  feature  /  based  on 
its  minimum  and  maximum  across  all  cases.  The  normalized  value  is  computed  using 


Norn\alize(/,  r) 


V  -  niinimum(/) 
maxiinum(/)  -  minimum(/) 


(4) 


The  second  normalization  function  we  tested  is  :-scorc.  This  function  subtracts  the  feature's 
mean  value  from  the  feature  value  and  divides  by  the  feature’s  standard  deviation.  The  results 
for  all  three  normalization  procedures  and  both  case  representations  are  shown  in  Table  4.“'  In 
summary,  the  scaled  representation  supports  the  highest  accuracies,  and  using  the  linear  interval 
normalization  function  yields  lower  accuracies  than  the  other  two  strategies. 


3.3  Combining  the  Original  Algorithms 

Fala  and  Walker  (1993)  suggested  combining  the  effects  of  their  algorithms.  There  are  four 
obvious  combinations  to  consider  corresponding  to  combining  pairs  of  the  three  algorithms  or  all  of 
them  at  once.  In  each  case,  the  combined  algorithm  simply  invokes  more  than  one  of  the  similarity 
functions  in  Table  1.  For  example,  when  using  the  three-algorithm  combination,  similarities  among 
pairs  of  cases  are  still  computed  by  summing  up  pairwise  similarities  among  the  features.  Thus,  the 
Matches  component  adds  one  to  the  similarity  when  the  two  values  are  equal  or  both  positive, 
the  Hits  component  then  adds  the  smaller  value  if  both  are  positive,  and  the  Misses  component 
fiubtracts  the  test  case's  value  if  it  is  positive  and  the  stored  case's  value  is  zero.  The  other,  pairwise 
combinations  of  the  three  similarity  functions  are  computed  similarly,  but  with  only  two  similarity 
comi)onents  rather  than  three. 

We  evaluated  all  four  combinations  by  using  both  ca.se  representations.  The  results  are  hown 
in  Table  5.  In  this  ca.se.  two  of  the  algorithms  show  improvement  under  some  of  the  normalization 
functions.  Perfect  classification  accuracy  results  for  this  data.set  when  combining  the  MATCHES 
and  -iMisses  similarity  functions  and  using  no  normalization  function  on  the  raw  data.  High 
accuracies  result  under  three  conditions  when  combining  all  three  similarity  functions.  In  summary, 
combinations  including  both  Matches  and  -'Misses  yield  better  performance  on  this  dataset. 


’The  second  through  fourth  columns  display  the  results  using  scaled  ca.se  representations.  The  third  column’s  re¬ 
sults  correspond  to  using  a  linear  interval  normalization  procedure,  while  the  fourth  column's  results  are  from  comput¬ 
ing  z-score  normalizations.  The  remaining  columns  include  the  same  results  when  using  unsealed  case  representations. 
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Table  5-  Leave-One-Out  Results  on  the  Sonar  Database  Usine  Scaled  and  Unsealed  Data 


Classifier  With  Scaled  Data  |  With  Raw  Data 


Matches-Misses 


Hits- Misses  859; 


Matches-i-Hits 


Matches+Hits-Misses  I  609;  j  959i)  i  759i 


80% 

65% 

70% 

807c 


90%  75% 


Table  6-  Leave-One-Out  Results  on  the  Sonar  Database  Using  the  1‘26-Feature  "Boundary- 
Overlapping"  Representation  _  _ 


Classifier 


Matches 


Hits 


Misses 


With  Scaled  Data  I  With  Raw  Data 


857  s57  85% 


75%  80%  80% 


Matches-Misses  jj  85% 

i  90% 

65%  80% 

85% 

-1 

o 

Hits-Misses 


Matches-i-Hits 

90% 

80% 

Matches-i-Hits- Misses 

85% 

Hamming 

85% 

85% 

85% 

70%  80% 

65%  I  70%  I  70%  65%  70% 


Euclidean 


Cubic 

70% 

C4.5 

80% 

CN2 


Backprop 


3.4  An  Overlapping  Case  Representation 

F'ala  and  Walker  (1993)  also  suggested  using  an  alternative  rase  representation  in  which  the 
frequency  bounds  overlap.  This  rour.'ic  ctM/tny  representation  (Ruiuelhart.  McClelland,  and  The 
PDF  Research  Group  1986)  modifies  the  original  representation  via  an  averaging  process.  Each 
feature  in  this  representation  corresponds  to  a  sequence  of  fre(|uency  boundaries  rather  than  to  a 
single  boundary.  For  example,  we  used  an  overlap  of  three  so  that  the  first  feature's  value  is  the 
sum  of  the  values  of  the  first  three  frequency  boundaries.  Similarly,  feature  i  contains  the  sum  of 
the  values  from  frecpiency  boundaries  i.  (-1-1.  ami  i  2.  where  i  ranges  between  zero  and  125. 
Table  6  summarizes  the  results  when  using  this  1'26-feature  re])resentation. 

In  general,  this  boundary-overlapping  repre.sentalion  did  not  yield  higher  performances.  None 
of  the  accuracies  were  over  90%.  However,  it  is  possil)le  that  alternative  boundary-overlapping 
representations  can  support  higher  classification  accuracies. 
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4.  DISCUSSION 

These  results  should  not  be  interpreted  as  an  indication  that,  for  example,  the  combined  al¬ 
gorithms  will  outperform  the  others  in  general.  We  plainly  see  that  the  algorithm’s  accuracies 
vary  depending  on  the  normalization  function  and  case  representation.  Each  algorithm  has  its  own 
classification  bias,  and  that  bias  can  only  be  best  for  a  finite  set  of  databases  (Utgoff  1986;  Schaffer 
1993). 

However,  it  is  interesting  to  note  that  a  cognitively  motivated  similarity  function,  such  as 
the  combination  of  Matches  and  Misses,  was  the  only  similarity  function  that  attained  perfect 
classification  accuracy  on  this  dataset.  Tversky  (1977)  has  argued  for  the  psychological  plausibility 
of  such  similarity  definitions.  His  contrast  model  of  similarity  is  similar  to  the  combination  of 
Matches  and  Misses  in  that  it  is  an  increasing  function  of  the  cases’  commonalities,  a  decreasing 
function  of  their  differences,  and  computes  these  separately.  However,  one  difference  is  that  the 
contrast  model  subtracts  differences  in  both  directions  rather  than  only  one  direction  (i.e.,  test 
value  minus  the  stored  value).  Thus  a  variant  of  Tversky ’s  model  would  also  subtract  from  the 
cumulative  cases’  similarity  the  value  of  the  stored  case’s  feature  whenever  it  was  positive  and 
the  test  case’s  value  for  that  feature  was  zero.  We  extended  Misses  to  include  this  property  and 
replicated  the  experiments.  The  results  were  similar  to  those  previously  reported. 

One  interesting  avenue  for  research  concerns  evaluating  ‘  generality  of  the  hybrid  Matches  -|- 
Misses  classifier.  While  it  performed  well  for  this  application,  we  would  like  to  determine  whether 
it  has  specific  general  benefits  in  comparison  to  more  established  algorithms. 

While  the  performance  of  some  individual  algorithms  improved  when  using  the  alternative 
normalization  functions,  in  general  normalization  did  not  improve  classification  performance.  This 
information  is  still  useful  in  that  it  tells  us  that  the  good  performance  of  Fala  and  Walker’s  classifiers 
is  not  primarily  due  to  anomalies  of  the  initial  representation.  After  all.  Matches  performed 
equally  well  under  all  of  the  normalization  methods  that  were  tested  while  the  performance  of  the 
other  two  algorithms  decreased  slightly  when  using  the  linear  interval  and  ^-score  functions. 

Naturally,  there  are  several  questions  not  addressed  by  this  research,  such  as  the  relationship 
between  the  similarity  function  and  the  case  representation  and  their  contribution  towards  perfor¬ 
mance.  Studying  this  requires  varying  one  component  of  the  case-based  classifier  while  controlling 
the  selection  of  the  other  components.  For  example,  we  would  like  to  understand  the  comparative 
limitations  of  case-based  classifiers  and  other  anproaches.  This  is  partially  addressed  elsewhere 
(.4ha  1992),  but  these  experiments  are  beyond  the  scope  of  this  report. 

5.  INCORPORATING  DOMAIN  KNOWLEDGE 

Although  it  is  comforting  that  perfect  classification  accuracy  could  be  achieved  on  this  sonogram 
dataset  via  a  combination  of  the  Matches  and  Misses  algorithms,  we  do  not  know  whether  this 
result  will  scale  up.  That  is,  the  current  dataset  is  quite  small  -  only  20  cases  -  and  this  hybrid 
algorithm  may  not  perform  well  on  larger  datasets. 

The  algorithms  we  have  described  so  far  are  knowledge-poor  in  that  they  use  only  a  minimum 
of  domain-specific  knowledge.  In  practical  applications,  there  is  no  substitute  for  such  knowledge, 
and  we  believe  that  to  attain  equally  good  classification  accuracies  with  larger  sonogram  datasets, 
more  knowledge-intensive  CAse-hased  classifiers  will  be  required. 
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There  are  at  least  five  ways  to  incorporate  domain-specific  knowledge  into  a  case-based  classifier: 

1.  Use  a  more  appropriate  case  representation, 

2.  Improve  the  normalization  function, 

3.  Improve  the  similerity  function, 

4.  Improve  the  prediction  function,  and 

5.  Add  a  postprocessing  function. 

Several  of  these  methods  have  overlapping  effects.  For  example,  some  similarity  functions  effectively 
modify  the  case  representation. 

5.1  Case  Representations 

The  simplest  and  most  effective  way  to  improve  classification  performance  often  involves  using 
a  better  case  representation  for  the  given  task.  This  relies  on  having  experts  available  to  suggest 
alternative  representations.  For  example.  Fala  and  Walker  (1993)  suggested  that  experts  classify 
sonogram  readings  using  only  a  subset  of  the  sonogram  rather  than  its  entirety.  Thus,  perhaps 
these  cases  could  more  profitably  be  represented  using  only  a  subset  of  their  features. 

Alternatively,  if  there  is  sufficient  data  and  expertise  available  to  statistically  analyze  the  data, 
then  a  data  modelling  approach  could  be  used  to  repeatedly  propose  and  test  alternative  case 
representations. 

A  final  alternative  is  to  have  the  algorithm  itself  propose  alternative  case  representations.  In  the 
machine  learning  literature,  several  algorithms  implementing  feature  construction  and  constructive 
induction  have  been  used  to  modify  the  given  case  representation  (Birnbaum  and  Collins  1991). 
While  few  such  algorithms  have  been  described  for  use  with  case- based  classifiers  (e.g..  Aha  1991), 
constructive  induction  algorithms  have  greatly  improved  classification  behavior  on  a  limited  set  of 
applications.  Future  research  should  include  an  investigation  to  determine  whether  they  are  useful 
for  similar  sonOj^ram  classification  tasks. 


5.2  Normalization  Functions 

.Although  we  examined  three  simple  normalization  strategies  in  this  report  (i.e.,  none,  linear 
interval,  and  z-score),  they  are  certainly  knowledge-poor.  Recently,  Turney  (1993)  proposed  using 
several  contextual  normalization  functions  to  exploit  the  context  of  the  application.  One  of  Turney’s 
approaches  normalizes  data  by  estimating  each  feature's  expected  value  and  variance  using  some 
standard  prediction  function  on  “healthy  baseline”  data  and  then  normalizes  data  using  a  function 
of  these  estimates.  His  algorithm  improved  the  accuracy  of  a  simple  case-based  classifier  by  13% 
on  a  gas  turbine  classification  task.  Future  work  should  include  studying  whether  similar  functions 
could  be  used  for  sonogram  classification  tasks. 
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5.3  Similarity  Functions 

As  mentioned  previously,  a  primary  weakness  of  the  nearest-neighbor  function  is  that  it  is 
sensitive  to  the  presence  of  irrc'f'-.ant  features  in  the  case  representation.  This  is  because  its 
similarity  function,  the  Euclidean  distance  metric,  assumes  that  all  features  are  equally  relevant. 
That  is.  each  feature  has  equal  impact  on  similarity  computations. 

Dynamic  feature  selection  algorithms  alleviate  this  problem.  Most  of  them  assign  weights  to 
each  feature.  The  most  relevant  features  are  assigned  the  highest  weights.  For  example,  a  typical 
weighted- Euclidean  similarity  function  is 


(dis)Similarity(2:,  y) 


F 


Wi  X  (z,  -  y,)2, 


(5) 


where  IT,  is  the  weight  of  feature  i.  Using  this  function,  features  with  weights  of  zero  are  effectively 
ignored  during  similarity  computations,  whereas  features  whose  weights  are  high  have  the  most 
impact  on  determining  similarity.  Several  weight-learning  methods  have  been  proposed,  including 
algorithms  based  on  incremental  training  (Salzberg  1990;  Aha  1989),  genetic  algorithms  (Kelly  and 
Davis  1991),  decision  trees  (Kibler  and  .Aha  1987;  Cardie  1993),  information  theory  (Bakiri  1991), 
ones  for  symbolic- valued  attributes  (Staufill  and  Waltz  1986),  and  several  others. 

-All  of  these  algorithms  can  be  run  in  a  knowledge-poor  fashion.  However,  knowledge-intensive 
algorithms  are  often  more  appropriate,  especially  w-hen  only  a  small  amount  of  data  is  available 
for  an  application  with  a  large  instance  space.  For  example.  Cain.  Pazzani  and  Silverstein  (1991) 
demonstrated  that  a  simple  set  of  e.xplanation- based  learning  trees  can  be  used  to  determine,  for 
each  case,  which  attributes  are  relevant  for  their  application.  By  using  this  additional  knowledge, 
they  increased  their  classification  accuracy  by  18%  in  their  database  on  foreign  trade  negotiations. 
Future  research  shon'd  include  analyzing  how  well  similar  algorithms  improve  performance  on 
sonogram  classification  tasks. 

Jabbour  (et  al.  1987)  and  his  colleagues  have  published  studies  on  yet  another  method  for  in¬ 
corporating  knowledge  into  a  similarity  function.  Their  power  load  forecasting  system,  ALFA,  uses 
an  eight-nearest-neighbor  function  to  predict  power  load  for  the  Niagra  Mohawk  Power  Company 
(NIMO)  of  central  New  York  State.  Cases  consist  of  meteorological  data  from  three  cities  in  New 
A'ork.  To  prevent  similarities  from  being  computed  on  possibly  misleading  cases,  thresholds  are 
placed  on  the  tolerated  amount  of  difference  allowed  on  the  day  of  week,  hour  of  day,  and  month 
of  year  features.  If  these  thresholds  are  not  met.  then  the  similarity  between  two  cases  is  deemed 
to  be  zero  (i.e.,  effectively,  similarities  are  not  computed  for  large  portions  of  their  huge  database). 
Similar  domain-specific  thresholds  may  prove  useful  for  sonogram  classification  tasks. 


5.4  Prediction  Functions 

The  only  prediction  function  that  we  have  discussed  has  been  the  single  nearest-neighbor 
function.  .Alternative  functions  should  be  considered  in  future  research  tasks.  The  most  obvious 
alternative  is  /.'-nearest-neighbor  where  A'  >  1.  Many  studies  have  suggested  that  its  bias  is 
beneficial,  and  it  is  well-known  that  linear  increa.ses  in  fc  yield  exponential  decreases  in  the  difference 
between  the  learning  rates  of  A’-nearest-neighbor  and  the  Bayes  optimal  learner  (Cover  and  Hart 
1967;  Cover  1968;  Duda  and  Hart  1973). 
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Additionally,  when  A:  >  1,  alternative  similarity  functions  should  be  considered,  especially  those 
in  which  similarity  decreases  exponentially  with  distance  (Nosofsky  1986;  Hintzman  1988;  Aha 
and  Goldstone  1992).  These  studies  on  human  concept  formation  were  all  motivated  by  Shepard’s 
(1987)  findings  that  subjects  tend  to  generalize  two  stimuli  based  on  an  exponentially  decreasing 
function  of  the  stimuli's  distance  in  a  psychological  space.  This  observation  may  soon  prove  useful 
for  improving  the  performance  of  automated  classification  algorithms. 

5.5  Postprocessing  Functions 

Finally,  postprocessing  functions  have  also  been  shown  to  improve  the  performance  of  case- 
based  classifiers.  For  example,  after  ALFA  generated  a  prediction,  it  consulted  a  set  of  rules  to 
adjust  for  annual  population  drift  and  account  for  days  on  which  power  load  requirements  would 
differ  greatly  from  their  norm  (e.g.,  Super  Bowl  Sunday).  This  allowed  ALFA  to  attain  predictive 
accuracies  similar  to  those  attained  by  NIMO's  experts. 

Another  way  to  incorporate  knowledge  during  postprocessing  was  demonstrated  in  CABERESS 
by  Clark,  Feng,  and  Matwin  (1993).  They  simply  averaged  the  predictions  from  a  case-based 
classifier  with  those  derived  from  a  domain-specific  model  of  their  classification  task.  Similar 
approaches  should  be  useful  for  sonogram  classification  studies. 

6.  CONCLUSION 

This  report  describes  followup  studies  to  Fala  and  Walker’s  (1993)  study  of  three  CBR  algo¬ 
rithm’s  ability  to  classify  sonar  data.  We  replicated  their  experiments,  extended  them,  compared 
their  results  with  those  from  several  other  algorithms,  and  investigated  other  representations  and 
normalization  functions.  We  also  tested  Fala  and  Walker’s  suggestion  to  combine  their  similarity 
functions  and  found  that,  under  some  conditions,  perfect  or  near-perfect  classification  performance 
could  be  obtained  when  using  their  algorithms.  Our  future  interests  include  investigating  whether 
existing  knowledge-intensive  learning  strategies  for  ca.se-based  reasoners  can  improve  performance 
on  more  challenging  sonogram  classification  tasks.  Therefore,  we  outlined  many  possible  ways  to 
explore  these  issues. 
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